Annotation of Corpora for Research in the Humanities Proceedings of the ACRH Workshop , Heidelberg , 5 Jan . 2012

نویسندگان

  • Francesco Mambrini
  • Marco Passarotti
  • Caroline Sporleder
  • Kristin Bech
  • Kristine Gunn Eide
  • Fabio Cavulli
  • Christian Girardi
  • Massimo Poesio
چکیده

Preface In almost every culture of the world, the sciences of language and the study of cultural-heritage documents are inextricably bound. In Western tradition, as in many others, the need to preserve the literary or historical legacy of the past gave the strongest input to the development of a formalized grammatical speculation. Since the foundation of linguistics as a discipline, the interaction has proceeded in both directions. Linguistics has profited from the huge amount of material that was gathered by philologists and historians, along with the full apparatus of concepts and problems that originated from their work. Humanities, in their turn, have often seen in linguistics a model of a rigorous scientific approach to a social and historically complex phenomenon like human language. It is not by chance, thus, that the work of scholars engaged in historical and literary studies was not alien to one of the most original development in contemporary linguistics, namely the creation and use of the first digital corpora. It is worth remembering that the Index Thomisticus, which is considered the starting point for both corpus-linguistics and digital humanities, was designed in order to allow a more rigorous approach to the philosophy of Thomas Aquinas. From the time of the first pioneering projects, the concepts and methodologies of corpus linguistics (including the notion of " corpus " itself) have been widely debated; technologies for storing and processing digital information have also changed radically. Nowadays, computational and corpus-linguistics have grown into autonomous disciplines, with their own set of required expertises. As in many other scientific fields, autonomy means inevitably a certain degree of isolation. The loss of contact between corpus-linguistics and humanities is particularly visible in one crucial aspect. Although quantitative or stylometric approaches to large collections of documents are increasingly frequent in literary or historical studies, the available resources are not quite at the same level as those used by linguists. The " Workshop on Annotation of Corpora for Research in Humanities " (ACRH) was held in Heidelberg University on January 5 th. The event was co-located to the 10 th edition of the international workshop on " Treebanks and Linguistic Theories " (TLT-10), also held in Heidelberg on January 6 th-7 th. The ACRH workshop was conceived to address one special aspect where the aforementioned gap between the two disciplines is particularly visible: the creation and exploitation of annotated corpora for the needs of research …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

A Comparative Study of Metadiscourse Markers in English and Persian University Lectures

The purpose of this study was to compare metadiscourse markers in forty English and Persian university lectures. Twenty of them were selected from the British Academic Spoken English corpus. The other 20 were selected from an Iranian website (www.maktoobkhane.com). We used Hyland’s (2005) model of metadiscourse. The metadiscourses were collected. Further, the frequency of each type was studied....

متن کامل

Annotated Bibliographical Reference Corpora in Digital Humanities

In this paper, we present new bibliographical reference corpora in digital humanities (DH) that have been developed under a research project, Robust and Language Independent Machine Learning Approaches for Automatic Annotation of Bibliographical References in DH Books supported by Google Digital Humanities Research Awards. The main target is the bibliographical references in the articles of Rev...

متن کامل

Inforex - a collaborative system for text corpora annotation and analysis

We report a first major upgrade of Inforex — a web-based system for qualitative and collaborative text corpora annotation and analysis. Inforex is a part of Polish CLARIN infrastructure1. It is integrated with a digital repository for storing and publishing language resources2 and it allows to visualize, browse and annotate text corpora stored in the repository. As a result of a series of works...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012